This page contains a list of Web archiving initiatives worldwide. For easier reading, the information is divided in three tables: web archiving initiatives, archived data and access methods.
Contents |
Name | Country | Creation Year | Technologies | Number of Employees | Comments | |
---|---|---|---|---|---|---|
Full-time | Part-time | |||||
Australia's Web Archive[1] | Australia | 1996 | PANDORA Digital Archiving System (PANDAS), NLA Trove, HTTrack. | 4 | >4.25 | It is a collaborative program of 11 agencies that provide an estimate average monthly staffing equivalent to 4 FTE. IT outsourced support: 0.25 person-month. Whole Domain Harvests are conducted by the Internet Archive using Heritrix, Wayback Machine. |
Our digital island, a Tasmanian Web Archive[2] | Australia | 1996 | HTTrack, Experimentally: Web Curator, Heritrix and Wayback Machine | 1 | ||
PageFreezer [3] | Canada, US, Netherlands, Belgium | 2005 | PageFreezer's Deep Web Crawler, Lucene, Solr | Enterprise Class On Demand service to archive and replay websites, blogs, Ajax, Flash, video, audio & social media for litigation protection, eDiscovery and regulatory compliance with FDA, FINRA, FSA, SEC, SOX, Federal Rules of Evidence and records management laws. | ||
Web@rchive Austria[4] | Austria | 2008 | Archive-access tools and NetarchiveSuite.dk | 2 | ||
DILIMAG (Digital Literature Magazines)[5] | Austria | 2007 | WebCurator | 2 | One technician, one for collecting and metadata. | |
Government of Canada Web Archive (GCWA)[6] | Canada | 2005 | Heritrix, Wayback Machine and Nutchwax. | 2 | ||
Web Information Collection and Preservation - WICP (Chinese Web Archive)[7] | China | 2003 | Heritrix, Wayback Machine and Nutchwax. | |||
Croatian Web Archive (Hrvatski arhiv weba - HAW)[8] | Croatia | 2004 | Lucene | 4 | 3 | 2 librarians full time, 2 librarians part time, 1 IT professional (National and University Library in Zagreb), 1 or 2 IT professionals (from Zagreb University Computing Centre (Srce)- our partner) |
WebArchiv (National Library of the Czech Republic)[9] | Czech Republic | 2000 | Nutch, NutchWAX and WERA tools. | 5 | 3.5 FTE library staff + approx. 1.5 FTE technical staff | |
Netarkivet.dk[10] | Denmark | 2005 | NetarchiveSuite.dk and Heritrix. | 18 | 18 people involved (developers, librarians, operations staff, project managers). All together 5 FTE. | |
Finnish Web Archive[11] | Finland | 2008 | NutchWAX | 2 | >2 | Group of librarians that in part-time select what to archive from the Finnish web space. |
BnF - BnF Web Legal Deposit[12] | France | 2006 | Heritrix, Wayback Machine and NutchWAX. NetarchiveSuite. | 9 | ||
Ina (Institut National de l'Audiovisuel)[13] | France | 2009 | Crawl : PhagoSite, Croket, Heritrix / Access : Dowser | 6 | Staff of 80 documentalists taking part in nominating sites and QA | |
E-diaspora (Télécom ParisTech, FMSH)[14] | France | 2010 | Crawl : PhagoSite | 1 | 30 researchers taking part in nominating sites | |
Internet Memory Foundation (ATN service)[15] | France, Netherlands | 2004 | IM large scale crawler (under development), Heritrix, Hanzo's crawler, IM Access software. Storage of Web Content: Hbase | 21 | 0 | 11 people for quality crawls (QA, crawl engineering, project management), 9 developers & infrastructure, 1 manager. |
Bibliotheksservice-Zentrum Baden-Württemberg[16] | Germany | 2003 | 7.5 | |||
Web archive of the German Bundestag[17] | Germany | 2005 | ||||
Iceland[18] | Iceland | 2004 | Heritrix, Wayback Machine | |||
Japan Web Archiving Project[19] | Japan | 2004 | Heritrix, Solr. Previously: Wget, Accela BizSearch | 10 | 2 | Launched in April 2004 as a pilot project, WARP (Web Archiving Project) has been in full-scale operation since July 2007.[20] |
National Library of Korea - OASIS (Online Archiving & Searching Internet Sources)[21] | Korea | 2001 | Own system based on Oracle DBMS and specialized search engine (IRS) that performs data management and search function. | 3 | 11 | |
Koninklijke Bibliotheek[22] | Netherlands | 2006 | Heritrix, KB e-Depot system | 1 | ~7 | |
National Library of Latvia[23] | Latvia | 2005 | Heritrix | 1 | Currently only storing for preservation, access to public in development (ETA June 2012). The latvian term for web harvesting is "rasmošana". | |
New Zealand Web Archive[24] | New Zealand | 1999 | Wayback Machine | 3 | >10 | 3-4 people at the National Library (various hours) and 2 people at the Internet Archive during the time of domain harvests. Selective web archiving = 3 full time staff. Technical services = 1 staff member responds to technical problems when they arise. National Digital library = 2-3 staff members ad hoc. NDHA (National Digital Heritage Archive) = various staff members respond to web archiving issues as they arise. |
The National Library of Norway[25] | Norway | |||||
Portuguese Web Archive[26] | Portugal | 2007 | Heritrix, Wayback Machine, NutchWAX | 4 | 1 | |
Web archive of Čačak[27] | Serbia | 2009 | HTTrack | 1 | ||
Web Archive Singapore[28] | Singapore | Wayback Machine, Heritrix, NutchWAX, WERA | ||||
Slovenian Web Archive[29] | Slovenia | 2007 | Heritrix, Wayback Machine | 1 | ||
Digital Preservation of .ES domain[30] | Spain | 2006 | Internet Archive | 2 | >2 | Can pool additional resources if necessary from computing controllers and financial department. |
Digital Heritage of Catalonia[31] | Spain | 2006 | Heritrix, Wayback Machine, WERA, Nutchwax and Web Curator. | 4 | ||
Basque Digital Heritage Archive[32] | Spain | 2008 | Heritrix, Wayback Machine, Nutchwax and Web Curator. | 1 | ||
Sweden (Kulturarw3)[33] | Sweden | 1996 | Heritrix. Own system for storage, maintenance and access | 1.25 | Paus in operation november 2009 - may 2011. | |
Aleph Archives[34] | Switzerland/USA | 2010 | Distributed crawler, ArchiView access plugin, High performance search engine, Near real time indexing, Web Monitoring tools | 7 | Enterprise-grade Web archiving platform for online heritage (content, brands) preservation and eDiscovery aimed to corporates, institutions, legal and government industries seeking to preserve their web contents regardless of their types (websites, wikis, social media, forums...). | |
Web Archive Switzerland[35] | Switzerland | 2008 | Heritrix, Wayback Machine | 3 | 1 crawl engineer, 1 person for quality assurance, 1 coordinator. The curators, who do the selection, are partner libraries all over Switzerland. | |
NTU Web Archiving System, NTUWAS[36] | Taiwan | 2007 | Lucene | 3 | ||
Web Archive Taiwan[37] | Taiwan | 2007 | ||||
The UK Web Archive[38] | UK | 2004 | Heritrix, Web Curator Tool, Wayback Machine and moving to Solr for searching. | |||
Hanzo Archives[39] | UK | 2006 | Hanzo Crawler, Search, and Access Tools. | Commercial web archiving services and appliances, for government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA. | ||
UK Government Web Archive[40] | UK | 2004 | ATN Service | 4 | 2 | Technical side of our web archiving operation is contracted out to the Internet Memory Foundation so the figures account for QA, curatorial and management staff only |
Internet Archive (provides Archive-it service)[41] | USA | 1996 | Heritrix, Wayback Machine, NutchWAX and other tools developed by the Internet Archive | 12 | ||
Reed Technology Web Archiving Services[42] | USA | 2010 | TrueArchive™ Technology | Reed Technology Web Archiving Services provides support for Litigation Protection, Compliance, e-Discovery and Social Media Management. | ||
Columbia University Libraries Web Resources Collection Program[43] | USA | 2009 | Archive-it service | 3 | >1 | Part-time consultation/supervision from other librarians adding up to about 1 FTE. |
North Carolina State Government Web Site Archives[44] | USA | 2005 | Archive-it service | 3 | ||
Latin American Web Archiving Project[45] | USA | 2005 | Archive-it service | |||
Web Archiving Project for the Pacific Islands[46] | USA | Archive-it service | 4 | |||
Library of Congress Web Archives[47] | USA | 2000 | Heritrix, Wayback Machine, and the DigiBoard, an in-house curatorial/permissions tool | 6 | 80 | The part time workers spend a few hours per month (on average) selecting content for the collections. |
Harvard University Library: the Web Archive Collection Service (WAX)[48] | USA | 2006 | Own system based on Archive-access and other open-source tools. | >6 | 3 part time on IT support. External curators within 3 units but don't know the size of them. | |
Web Archiving Service from California Digital Library (WAS service)[49] | USA | 2005 | Heritix, Wayback Machine, NutchWAX | 4 | >1 | The number of hours that curators devote to the service is very variable. |
University of Michigan Web Archives Project[50] | USA | 2000 | WAS service | 2 | ||
University of Texas at San Antonio Web Archives[51] | USA | 2009 | Archive-It | 3 | The number of hours varies dependent upon how the crawls are scheduled. | |
qumram[52] | Switzerland | 2010 | Chronos Web Archiving Software Suite | Commercial web archiving software suite. Provides both harvesting as well as transactional web archiving. Allows integrations with any possible repository (database, file system, electronic archive or records management system). Specializes on regulatory compliance. | ||
SAPERION[53] | Germany | 2011 | SAPERION ECM Web Content Archive | Commercial enterprise content management suite specializes on regulatory compliance. The product provides both harvesting as well as transactional web archiving based on the integration of qumram´s[52] Chronos Web Archiving Software Suite. Web content is just another chanel from which content is reaching SAPERION. Others may be scanner, fax, e-mail, mobiles devices, office suites or any other system creating content like ERP systems. | ||
Bibliotheca Alexandrina's Internet Archive | Egypt | 2002 | Heritrix, Wayback Machine | 3 | Current crawling interests: Egypt beyond January 25, Arab League ccTLDs |
Name | Archived Contents (millions) | Disk Space Occupied (TB) | Archive Format | TLD/Broad Crawls | Selective Crawls (Yes/No) | Comments | |
---|---|---|---|---|---|---|---|
Australia's Web Archive[1] | 3100 | 104.5 | ARC/WARC | .AU | Y | .AU crawls (2005-2009): 3 billion files (100 TB). Selective crawls (1996-today): 100 million files (4.5 TB). There are 3 copies of each content. | |
Our digital island, a Tasmanian Web Archive[2] | 0.336 | HTTrack | Y | Preserves online contents related to Tasmania. ODI has operated since its inception under the assumption that web sites fall within the definition of ‘Book’ in the Tasmanian Library Act 1984.[54] Thus, no permission to capture from publishers is required. | |||
Web@rchive Austria[4] | 455 | 6.61 | ARC | .AT | Y | A copy of the data will be stored in a high security data storage unit. | |
DILIMAG (Digital Literature Magazines)[5] | 0.03 | 0.996 | ARC | Project from 2007-03-01 until 2010-12-23. The project DILIMAG for collecting, describing and archiving of digital German literary magazines. | |||
Government of Canada Web Archive (GCWA)[6] | 170 | 7 | Y | Selective crawls of the web domain of the Federal Government of Canada (.GC.CA) | |||
Web Information Collection and Preservation - WICP (Chinese Web Archive)[7] | .GOV.CN | Y | Harvest of the web pages about the events that have great influence on the society, economy and so on, and the sites in 'gov.cn'
domain. |
||||
Croatian Web Archive (Hrvatski arhiv weba - HAW)[8] | 81 | 3.4 | Y | ||||
WebArchiv (National Library of the Czech Republic)[9] | 526 | 24 | .CZ | Y | Harvesting began in 2001. | ||
Netarkivet.dk[10] | 6008 | 190 | ARC/WARC | .DK | Y | It uses NetarchiveSuite.dk was developed by two Danish libraries and Heritrix. | |
Finnish Web Archive[11] | 494 | 23 | .FI, .AX | Y | Also crawls contents hosted on machines physically located in Finland, independently from their domain. | ||
BnF - BnF Web Legal Deposit[12] | 14000 | 200 | ARC/WARC | .FR | Y | ||
Ina (Institut National de l'Audiovisuel)[13] | 8400 | 56 | DAFF | N | Y | Digital Archive file format handles file redundancies. The size on disk takes into account compression and deduplication ; the equivalent disk storage in compressed ARC format would be 665 Tb | |
E-diaspora (Télécom ParisTech, FMSH)[14] | 237 | 2 | DAFF | N | N | Digital Archive file format handles file redundancies.The size on disk takes into account compression and deduplication ; the equivalent disk storage in compressed ARC format would be 10 Tb | |
Internet Memory Foundation (ATN service)[15] | 180 | WARC | Can be done by partners | Y | Formerly European Archive.[55] Provides the Archive The Net Service (ATN Service). Selective crawls (140 TB), Domain crawls (40 TB), expect to grow to 1PB in 2011. New datacenter and a new crawler in 2011. | ||
Bibliotheksservice-Zentrum Baden-Württemberg[16] | 1 | HTTrack | Y | Bibliotheksservice-Zentrum Baden-Württemberg -German is operating following Web-Archives: 1- Baden-Württembergisches Online-Archiv (BOA) 2- Saardok 3- Literatur im Netz des Deutschen Literaturarchivs Marbach.[56] |
|||
Web archive of the German Bundestag[17] | Y | German Federal Parliament. Selective. At regular intervals or at certain events are snapshots (snapshots) of www.bundestag.de and other web presences of the German Bundestag made. These are available in the web archive to date available. | |||||
Iceland[18] | |||||||
Japan Web Archiving Project[19] | 319.8 | 38.2 | WARC | - | Y | 15 TB of selective crawls based on permission (2002–2010). Started the web archiving of official institution sites based on the legislation from April 2010. | |
National Library of Korea - OASIS (Online Archiving & Searching Internet Resource)[21] | 24 | Y | Requires consent before archiving. Targets 56,401 Websites. Web archiving is managed under Digital resource management systems. In 2011 web arching system will be rebuild. | ||||
Koninklijke Bibliotheek[22] | 5 | ARC | Y | ||||
New Zealand Web Archive[24] | 346 | 13 | .NZ | Y | .NZ crawls: 105 million URLs (4.1 TB) in 2008, 170 million URLs (6.1 TB) in 2010. Selective crawls of 7 599 websites in the National Digital Heritage Archive (2.8 TB), 71 million contents estimated. Legal deposit covers born digital material (including websites). | ||
The National Library of Norway[25] | |||||||
Portuguese Web Archive[26] | 889 | 25 | ARC | .PT, .CV, .AO, .MZ | Y | TLD crawls and integration of external collections since 2007, selective crawls since 2010. | |
Web archive of Čačak[27] | 0.255 | 0.013 | HTTrack | Y | Selective crawls of 130 sites related to the city of Čačak. Collaboration with the WebArchiv team from the National Library of the Czech Republic. | ||
Web Archive Singapore[28] | .SG | Y | Selective crawls of 1000 Singapore-related sites, with the written consent of the owners. Whole .SG domain archiving. | ||||
Slovenian Web Archive[29] | 1.5 | WARC | Selective crawls | ||||
Digital Preservation of .ES domain[30] | 855 | 30 | ARC | .ES | Collaboration with Internet Archive. Domain crawl of .ES, harvested quarterly. Not launched publicly yet. | ||
Digital Heritage of Catalonia[31] | 200 | 7.7 | ARC | .CAT | Y | In accordance with the general trend, the archive model is a hybrid system consisting: Mass compilation of open-access digital resources published on the Internet (.cat); Systematic archiving of the web site output of Catalan organizations; Fostering of lines of research through themed integration of the digital resources pertaining to specific events in Catalan public life (elections, museums, etc.) | |
Basque Digital Heritage Archive[32] | 21 | 0.8 | ARC | Y | |||
Sweden (Kulturarw3)[33] | 1710 | 71.3 | Multipart MIME | .se, Swedish .nu and geolocation for other tld's | Y | Bulk crawls approximately twice a year. Selective crawls of about 140 newspapers every day. |
|
Aleph Archives[34] | 23 | WARC, WARC2, ARC and HTTrack to WARC migration tools | Y | Enterprise-grade Web archiving platform for online heritage (content, brands) preservation and eDiscovery aimed to corporates, institutions, legal and government industries seeking to preserve their web contents regardless of their types (websites, wikis, social media, forums...). | |||
Web Archive Switzerland[35] | 0.1 | ARC | Y | ||||
NTU Web Archiving System, NTUWAS[36] | 200 | 14 | Y | ||||
Web Archive Taiwan[37] | |||||||
The UK Web Archive[38] | 6.9 | ARC | Y | Selective crawls with previous permission. Expect to run wholesale UK domain-scale crawls once Legal Deposit legislation is implemented in April 2011. The UKWA is a spin-off from the UK Web Archiving Consortium that ended in 2007. | |||
Hanzo Archives[39] | 7 | WARC | Y | Commercial web archiving services and appliances, for government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA. | |||
UK Government Web Archive[40] | 32 | ARC | The UKGWA is a spin-off from the UK Web Archiving Consortium that ended in 2007. | ||||
Internet Archive (provides Archive-it service)[41] | 150000 | 5500 | World-wide | Y | Provides the Archive-it service and leads the Archive-access project (Internet Archive ARC access tools). Collection is mirrored at Bibliotheca of Alexandrina in Egypt. | ||
Reed Technology Web Archiving Services[42] | |||||||
Columbia University Libraries Web Resources Collection Program[43] | 23.1 | 1.8 | ARC/WARC | Y | Selective crawls with permission or notification; primarily thematic collections. | ||
North Carolina State Government Web Site Archives[44] | 51.5 | 3.8 | WARC | Y | |||
Latin American Web Archiving Project[45] | Y | ||||||
Web Archiving Project for the Pacific Islands[46] | 5.5 | ARC/WARC | Y | Includes sites of 18 countries. | |||
Library of Congress Web Archives[47] | 5 | 230 | ARC/WARC | Y | Formerly MINERVA. Selective crawls with notification and permission; primarily event and thematic collections. | ||
Harvard University Library: the Web Archive Collection Service (WAX)[48] | 19 | 0.661 | ARC | Y | Selective crawls with no previous authorization. | ||
Web Archiving Service from California Digital Library (WAS service)[49] | 216 | 25.2 | ARC/WARC | Can be done by partners | Y | Provides Web Archiving Service (WAS) to partners world-wide. Was developed at the California Digital Library. | |
University of Michigan Web Archives Project[50] | 0.65 | ARC/WARC | Y | WAS service since 2010. | |||
University of Texas at San Antonio Web Archives[51] | 26 | 1.135 | ARC/WARC | Y | University administration, faculty and student sites; as well as selective captures on San Antonio and South Texas subject areas, including San Antonio organizations; San Antonio Online Journals and Blogs; Tejano and Conjunto music; Gay, Lesbian, Bisexual, Transgender and Queer Related Web sites in Texas, San Antonio and the Rio Grande Valley; Immigration/Borderlands; Mexican Cooking Blogs; San Antonio Restaurants; Renewable Energy in Texas; Rio Grande Valley Organizations; and Rio Grande Watershed and Texas Water Issues . |
Name | URL history (Yes/No) | Meta-data (catalog/advanced) search (Yes/No) | Full-text search (Yes/No) | Comments |
---|---|---|---|---|
Australia's Web Archive[1] | N | Y | Y | Selected sites are publicly available through a directory structure. Domain harvests are not. The PANDORA Archive is indexed and searchable through the NLA's single search service Trove.[57] The Australian Domain Harvests are full-text indexed but are not currently publicly available. |
Our digital island, a Tasmanian Web Archive[2] | Y | Y | N | Presents thumbnails generated through Html To Image supplemented in HTTrack. Information is organized in directory: A-Z Subject listing, A-Z Title listing. |
Web@rchive Austria[4] | Y | N | N | Only accessible on special terminals at the Austrian National Library. Presents thumbnail previews of archived pages and supports keyword search within URL. |
DILIMAG (Digital Literature Magazines)[5] | Y | Y | N | Metadata are publicly available, for the archived versions provides free or restricted access depending on the right holders agreement. Full-text search was not implemented due to lack of resources. |
Government of Canada Web Archive (GCWA)[6] | Y | Y | Y | Technical details available.[58] |
Web Information Collection and Preservation - WICP (Chinese Web Archive)[7] | Y | Archive content is only available in intranet in National Library of China. Some collections are publicly available, with meta-data search and browsable by collection. | ||
Croatian Web Archive (Hrvatski arhiv weba - HAW)[8] | Y | Y | Y | |
WebArchiv (National Library of the Czech Republic)[9] | Y | Y | Due to copyright restrictions, only a limited number of archived websites for which agreements were signed with the publishers is available online. For other resources you can find out whether a given website was archived and the number of harvested versions. Unlimited access to all resources in WebArchiv is available from public terminals in the National Library. | |
Netarkivet.dk[10] | Y | N | N | Online access granted only to researchers using a proxy solution that accesses an archive index. Soon it will set up user access through the Wayback Machine. It has established a framework for running batch jobs with the possibility of data mining. |
Finnish Web Archive[11] | Y | N | 30% of material. | URL search but onsite access to contents. Full-text search is available to 30% of material. |
BnF - BnF Web Legal Deposit[12] | Y | N | 15% of the collection | Accessible to authorized users of the BnF, through the reading rooms of the Research Library located in Paris and Avignon. Wayback Machine interface was translated to French. Full Text search only for a relatively small portion of the collection (15% of 200 TB) indexed by Internet Archive. No current full text search implemented in workflow. Builds special collection galleries based on a selection from the archive on a given topic. |
Ina (Institut National de l'Audiovisuel)[13] | Y | Y | Y | Full text indexing is based on Lucene. To accommodate results from frequent crawls (up to every 2 hours for home pages) clustering is operated to handle similar versions of pages |
E-diaspora (Télécom ParisTech, FMSH)[14] | Y | N | N | 1381 sites are currently crawled to build an archive on migrants usage of the web, social studies researchers have launched a long run project based on this archive (http://ediasporas.ticmigrations.fr/) Ina is hanling crawls and storage |
Internet Memory Foundation (ATN service)[15] | Y | Y | Y | Provides access and search services according to partners policy. |
Bibliotheksservice-Zentrum Baden-Württemberg[16] | Y | Y | Y | Search available (on development).[59] |
Web archive of the German Bundestag[17] | Y | N | N | Web archive itself are snapshots of www.bundestag.de and other websites. Navigation is possible by clicking on the years.[60] |
Iceland[18] | ||||
Japan Web Archiving Project[19] | Y | Y | Y | Public access to sites after permission of the site owners. Open access to important publications such as white papers. |
National Library of Korea - OASIS (Online Archiving & Searching Internet Resource)[21] | Y | Y | Y | 100% of the archive is indexed. Enables search by topic classification (e.g. Religion, Science, Arts). Search available.[61] |
Koninklijke Bibliotheek[22] | The web archive will become available online during the first half of the year 2010. | |||
New Zealand Web Archive[24] | Y | Y | N | Domain harvests are available to selected staff only using Wayback and limited to URL searchers. Selected harvestings, each website is described in the catalogue (providing subject, author, title and URL searches) and can be viewed by the public via the Internet by clicking on the link to the archived copy. The websites themselves however are not indexed. |
The National Library of Norway[25] | N | Y | Sites are integrated in the Catalog. Left bar enables facet navigation with drill-down.[62] | |
Portuguese Web Archive[26] | Y | Y | Y | 20% of the archive is indexed and na experimental full-text service is available. Archived data can be mined through an Hadoop platform. |
Web archive of Čačak[27] | N | N | N | Plans to develop a search engine in the future. One bad characteristic of HTTrack is that it renames files during the archiving, so the original structure of the website is lost, as well file names. |
Web Archive Singapore[28] | ||||
Slovenian Web Archive[29] | Y | N | N | The archive is not public yet. Plans to implement full-text search. |
Digital Preservation of .ES domain[30] | Y (Future) | Y (Future) | Plan to grant access through computers available at a given hall. | |
Digital Heritage of Catalonia[31] | Y | Y | Y | Full open access. |
Basque Digital Heritage Archive[32] | Y | Y | Y | |
Sweden (Kulturarw3)[33] | Y | N | N | Public access through dedicated machines in the library building. |
Aleph Archives[34] | Y | Y | Y | The full text search engine support automatic metadata extraction, and native results deduplication. Also included: antivirus checker (~250mil. pages/day), archives statistics , text summarizer, archives exports (PDF, PNG, TIFF), etc. |
Web Archive Switzerland[35] | Y (in 2011) | Y (in 2011) | The archived versions of the sites are not yet accessible. Web Archive Switzerland will be open to the public by spring 2011 - only access within the National Library and the partner libraries will be possible. The sites are being catalogued and the records are integrated in our library catalog Helveticat.[63] | |
NTU Web Archiving System, NTUWAS[36] | Y | Y | Y | Presents page thumbnails, archived pages mapped to geographical locations. |
Web Archive Taiwan[37] | Y | Y | Y | |
PageFreezer [3] | Y | Y | Y | Enterprise Class On Demand service to archive and replay websites, blogs, Ajax, Flash, video, audio & social media for litigation protection, eDiscovery and regulatory compliance with FDA, FINRA, FSA, SEC, SOX, Federal Rules of Evidence and records management laws. Used by government agencies and public listed corporations in Pharmaceutical, Food, Finance, Healthcare and Retail industry. |
The UK Web Archive[38] | Y | Y | N | |
Hanzo Archives[39] | Y | Y | Y | Commercial web archiving services and appliances. Access includes full-text search, annotations, redaction, URL/History, archive policy and temporal browsing, and configurable metadata schema for advanced e-discovery applications. Used in government and corporations whose compliance or legal obligations / needs extend to their websites, intranet, and social media. Many 'dark' archives across Europe and USA. |
UK Government Web Archive[40] | Y | Y | Y | Full text search is operational on the UK Government Web Archive.[64] Users can browse the collection using a full A-Z list of all sites[65] and a set of categories.[66] |
Internet Archive (provides Archive-it service)[41] | Y | Y | Y | URL history is available for all archived data. Meta-data and full-text search only for selected crawls. Until 2002 had a mining platform for research composed by Alexa Shell Perl Tools
av_tools and p2 platform for parallel processing.[67] It was replaced by a simpler access and direct method that enables automatic access to files but no platform for processing.[68] |
Reed Technology Web Archiving Services[42] | ||||
Columbia University Libraries Web Resources Collection Program[43] | Y | Y | Y | Accessible through Archive-it service.[69] |
North Carolina State Government Web Site Archives[44] | Y | Y | Y | Accessible through Archive-it service.[69] |
Latin American Web Archiving Project[45] | Y | Y | Y | Content can be accessed via full-text search, or by browsing by country or by specialized sample collection. |
Web Archiving Project for the Pacific Islands[46] | Y | Y | Y | Supported by Archive-it service. |
Library of Congress Web Archives[47] | Y | Y | N | Access provided via http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html. Records in MODS (Metadata Object Descriptive Schema) format. |
Harvard University Library: the Web Archive Collection Service (WAX)[48] | Y | Y | Y | |
Web Archiving Service from California Digital Library (WAS service)[49] | Y | Y | Y | Access for private study, scholarship and research. Most archives built with WAS have not yet been published because it is up to the partners to decide if they want to provide access. There are 16 partners using the service and they have created over 80 web archives, only 30 are publicly accessible. NutchWAX performance did not permit full archive search. Upcoming transition to SOLR will permit both full archive and collection-specific full text search. |
University of Michigan Web Archives Project[50] | Y | Y | Y | Powered by the WAS from the California Digital Library.[70] Access is public but usage is restricted for private study, scholarship and research. |
University of Texas at San Antonio Web Archives[51] | Y | Y | Y | Accessible through Archive-it service[71] and the Texas Archival Repositories Online database[72] |